A Bayesian model of bilingual segmentation for transliteration

نویسندگان

  • Andrew M. Finch
  • Eiichiro Sumita
چکیده

In this paper we propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrasebased statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. The Bayesian segmentation was used to construct a phrase-table and we compared the quality of this phrase-table to one generated in the usual manner by the state-of-the-art GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system, by using both to perform transliteration generation within an identical framework. In our experiments on English-Japanese data from the NEWS2010 transliteration generation shared task, we used our technique to bilingually co-segment the training corpus. We then derived a phrasetable from the segmentation from the sample at the final iteration of the training procedure, and the resulting phrase-table was used to directly substitute for the phrase-table extracted by using GIZA++/MOSES. The phrase-table resulting from our Bayesian segmentation model was approximately 30% smaller than that produced by the SMT system’s training procedure, and gave an increase in transliteration quality measured in terms of both word accuracy and F-score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Application of Bayesian Alignment Techniques to Transliteration Generation and Mining

Bayesian techniques have recently been applied to many areas of natural language processing, and have proven themselves particularly useful in areas involving segmentation and alignment. This paper looks at the direct application of these techniques to the co-segmentation/alignment of grapheme sequences. We detail a novel Bayesian model for unsupervised bilingual character sequence alignment of...

متن کامل

A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

Machine Transliteration is an essential task for many NLP applications. However, names and loan words typically originate from various languages, obey different transliteration rules, and therefore may benefit from being modeled independently. Recently, transliteration models based on Bayesian learning have overcome issues with over-fitting allowing for many-to-many alignment in the training of...

متن کامل

Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora

This paper proposes a novel noise-aware character alignment method for bootstrapping statistical machine transliteration from automatically extracted phrase pairs. The model is an extension of a Bayesian many-to-many alignment method for distinguishing nontransliteration (noise) parts in phrase pairs. It worked effectively in the experiments of bootstrapping Japanese-to-English statistical mach...

متن کامل

Dialect Translation: Integrating Bayesian Co-segmentation Models with Pivot-based SMT

Recent research on multilingual statistical machine translation (SMT) focuses on the usage of pivot languages in order to overcome resource limitations for certain language pairs. This paper proposes a new method to translate a dialect language into a foreign language by integrating transliteration approaches based on Bayesian co-segmentation (BCS) models with pivot-based SMT approaches. The ad...

متن کامل

Clustering and Classifying Person Names by Origin

In natural language processing, information about a person’s geographical origin is an important feature for name entity transliteration and question answering. We propose a language-independent name origin clustering and classification framework. Provided with a small amount of bilingual name translation pairs with labeled origins, we measure origin similarities based on the perplexities of na...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010